Hierarchy and Feedback in the Evolution of the E. coli Transcription Network. 
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The E.coli transcription network has an essentially feedforward structure, with, however, abundant feedback 
at the level of self-regulations. Here, we investigate how these properties emerged during evolution. An as- 
sessment of the role of gene duplication based on protein domain architecture shows that (i) transcriptional 
autoregulators have mostly arisen through duplication, while (ii) the expected feedback loops stemming from 
their initial cross-regulation are strongly selected against. This requires a divergent coevolution of the tran- 
scription factor DNA-binding sites and their respective DNA cis-regulatory regions. Moreover, we find that the 
network tends to grow by expansion of the existing hierarchical layers of computation, rather than by addition 
of new layers. We also argue that rewiring of regulatory links due to mutation/selection of novel transcription 
factor/DNA binding interactions appears not to significantly affect the network global hierarchy, and that hor- 
izontally transferred genes are mainly added at the bottom, as new target nodes. These findings highlight the 
important evolutionary roles of both duplication and selective deletion of crosstalks between autoregulators in 
the emergence of the hierarchical transcription network of E.coli. 



The successful adaptation of microorganisms to an environ- 
ment or host is determined by the correct response to external 
and internal stimuli through the simultaneous expression of a 
large set of genes. The basal mechanism that performs this 
task is transcriptional regulation, so that it becomes important 
to characterize this regulatory process from a global, or "net- 
work" viewpoint. Transcriptional regulation networks are de- 
fined starting from the basic functional elements of transcrip- 
tioni. To construct the associated graph, one usually repre- 
sents each operon with a node, and each regulatory interaction 
with a directed link A — > B between the target operon B and 
the operon A coding for a transcription factor (TF) that has 
at least one binding site in the cis-regulatory region of B. A 
transcription factor regulating its own expression is called an 
autoregulator (AR). With this definition, the interaction graph 
structure is accessible by large-scale and collections of small- 
scale experiments^! 5 .. 

Some topological and evolutionary properties of transcrip- 
tion networks have been elucidated 6 ' 7,8 . In particular, they can 
be analyzed in terms of a hierarchy of inputs that produce 
output response s 9 ' 1 ^ 11 . Specifically, the E. coli transcrip- 
tion network has an essentially feedforward layered structure, 
where feedback is mainly limited to autoregulations 9 ^. The 
abundance of the latter is, however, striking, as they con- 
cern more than half of the transcription factors^ 2 -. Here, af- 
ter quantifying the marginality of these properties with re- 
spect to a null network ensemble, we investigate how they 
could have emerged during evolution. An assessment of the 
role of gene duplication based on protein domain architecture 
shows that i) transcriptional autoregulators have mostly arisen 
through duplication, while ii) the expected feedback loops 
stemming from their initial cross-regulation, are strongly se- 
lected against. This requires a divergent coevolution of the au- 
toregulator DNA binding sites and their respective DNA cis- 
regulatory regions. Moreover, we find that the network shows 
a tendency to grow by expansion of the existing hierarchical 
layers of computation, rather than by addition of new layers. 
We also argue that de novo rewiring of regulatory links due 
to mutation/selection of novel transcription factor/DNA bind- 



ing interactions does not affect the hierarchy, and that hori- 
zontally transferred genes are mainly added at the bottom, as 
new target nodes. Our findings are consistent with a view of 
prokaryote evolution based on ancient duplications and con- 
servation of stable central parts despite widespread horizontal 
gene transfer s 13,14 . 

a. Feedback and Hierarchy. A priori, one may expect 
that transcription networks contain abundant feedback loops 
involving two or more gene s 15 ' 16 . However, for the case of 
E. coli, the available data indicate that this is not the cas o 9 ' 10 ' 11 . 
The Shen-Orr dataset 2 (423 operons; 117 TFs, 578 interac- 
tions) does not contain any non-self-regulatory feedback loop 
for the E. coli transcription network. Such a tree-like directed 
graph is naturally organized in feedforward layers of compu- 
tation, ending with target genes (TG) as "leaves". The layers 
and their numbering can be defined by the longest chain of 
(different) regulators upstream of each TF or TG in each layer 
(Figs. QJ&d). Members of layer one are regulated by at most 
themselves, members of layer two are regulated by a chain of 
one transcription factor and possibly themselves, and so on. 
There are five hierarchical layers in the Shen-Orr dataset 2 , 
which is considerably lower than for randomized null net- 
works (see Fig.QJ). About 50% of the nodes (TFs and TGs) 
lay in layer two, with 69% of all TF nodes located in layer 
one. The notable exception to this general lack of feedback 
is the substantial presence of feedback loops involving a sin- 
gle node, or autoregulators (59 ARs out of 1 17 TFs) 12 ' 17 ' 18 ' 19 . 
The more recent publicly available database RegulonDB 5.5 3 
includes larger dataset s 3,9 ' 10 (648 operons; 147 TFs, 1170 in- 
teractions, 85 ARs, excluding Sigma-factor interactions). By 
contrast with Shen-Orr dataset, it contains a few (4) non-self- 
regulatory feedback loops and a few more (a total of 7) hier- 
archical layers but still considerably less than in randomized 
null networks (see Supplementary Note S5). Hence the same 
trend is seen for both Shen-Orr and RegulonDB 5.5 3 datasets. 

To quantify the significance of regulatory feedback and hi- 
erarchical properties of the E. coli transcription network, we 
compared it for each dataset (Shen-Orr and RegulonDB 5.5) 
with randomized null networks with the same degree se- 
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quence, i.e. conserving the number of incoming and outgoing 
links for each node (Fig.Q]and Supplementary Note SI). For 
both data sets, the number of ARs found in the empirical net- 
work greatly exceeds the same quantity for randomized coun- 
terparts, confirming previous observations on self-regulatory 
feedbac k 2 ' 12 ' 19 . The importance of non-self-regulatory feed- 
back was quantified by the size of the regulatory core obtained 
after pruning the tree-like input and output cascades using 
the leaf-removal algorithm (see Fig. QJ and Supplementary 
Notes SI and S5). From this analysis, we conclude that the 
importance of transcriptional, non-self -regulatory feedback is 
significantly lower in both empirical networks (Shen-Orr and 
RegulonDB 5.5) than in their randomized network counter- 
parts, see Fig. lb and Fig. S5.10. 

The importance of hierarchy was also quantified. As there 
is no straightforward definition of hierarchy in general for net- 
works including feedback, we have used the total number of 
layers in the tree-like input and output branches of the network 
as practical definition of hierarchy. This also corresponds to 
the number of iterations of the leaf-removal algorithm (see, 
however, alternative definitions of hierarchy in Supplemen- 
tary Note S5). Note, in particular, that it correctly recov- 
ers the actual number of hierarchical layers for tree-like di- 
rected graphs (overlooking possible self-regulatory links as in 
the case of Shen-Orr dataset). Comparisons with null models 
were restricted to randomized networks with the same regu- 
latory core size. Remarkably, the number of hierarchical lay- 
ers was found to be considerably lower than in typical ran- 
domized network counterparts for both Shen-Orr and Regu- 
lonDB 5.5 datasets, see Fig. lc and Fig. S5.ll and Supple- 
mentary Note S5. 

b. Evolutionary Drives. What is the evolutionary origin 
of this peculiar structure? There are three main mechanisms 
for the evolution of a transcription network. ( 1 ) Gene duplica- 
tion, (2) rewiring of links by mutation/selection of TF/DNA 
interactions (3) horizontal gene transfer. All three mecha- 
nisms, which we discuss below in the context of transcription 
network evolution, have been shown to play a substantial role 
in prokaryote evolutio n 1,8 ' 14 ' 20 ' 21 ' 22 . For clarity, the following 
discussion refers only to the Shen-Orr dataset, which is still to 
date the most widely used dataset. The same detailed analy- 
sis on the RegulonDB 5.5 dataset is discussed in a dedicated 
section S5 in the Supplementary Note. 

Duplication. Following previous analyses 8 ' 23 , we de- 
fine proteins that are likely to share a common ancestor 
through structural domain assignments of the SUPERFAM- 
ILY database 24 . These domains allow for the definition of 
larger classes than sequence comparison alone 8 . The database 
enables to associate an ordered sequence of domains, or "do- 
main architecture" to each protein. We define protein ho- 
mologs as proteins whose domain architectures are identical 
neglecting domain repeats 29 . We have analyzed the distribu- 
tion of regulatory links between and within classes of likely 
duplicate genes. The statistical significance of the analysis in 
terms of homology classes is established 8 by comparison with 
random shufflings of genes (TFs and TGs separately) between 
classes. 
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FIG. 1: Feedback and hierarchy in E. coli transcription network, (a) 
Scheme of the layer structure of the network. Direction of regulatory 
links is from top to bottom. Each line represents a layer, populated 
by TFs (blue, thick line) and TGs (black, thin line). Members of 
layer i are regulated at most by i — 1 nodes plus themselves. By 
definition, layer one is constituted entirely by TFs. Annotations on 
the right hand side of the layers specify their population of TGs, TFs 
and ARs. (b) Evaluation of feedback with the leaf-removal algo- 
rithm. Right: illustration of the leaf-removal algorithm. Leaves are 
nodes that do not regulate any other node. Removal of one leaf and 
its regulatory links may create a new leaf. Iterative removal of leaves 
has to stop at a core of nodes that contains loops (blue, circled nodes, 
dashed links). The core might contain tree-like components upstream 
of the loops (black). Left: histogram of the number of nodes in the 
core Nc for randomized counterparts of E. colM. The data refer to 
1.1-10 6 accepted MCMC moves for randomization (see methods and 
Supplementary Note SI), (c) Histogram of the layer number in the 
randomized counterparts of the E. coli network. The average number 
of observed layers is about 12, to compare to the 5 of E. coli. The 
data correspond to a MCMC run where a total of 5.78 • 10 8 matri- 
ces where generated (of which about 1.23 ■ 10 8 were tree-like), (d) 
The flagella-building subnetwork is the only example of functional 
subnetwork that spans all the five layers. Here, this subnetwork is 
constructed arbitrarily starting from a member of layer one and fol- 
lowing the tree downstream. 



The first result, summarized in Table (see also Supple- 
mentary Table S5.1a), shows that duplicates of ARs tend to 
retain their self-links. We quantified this using two global pa- 
rameters, h ar and g ar . h ar is the average fraction of ARs in 
classes with two or more ARs. It measures the tendency to 
have many ARs in one class if two are already present (the 
reason of the cutoff is to exclude from the count classes with 
two members and only one AR). qar is the variance across 
classes of the fraction of ARs within a class. This parameter 
measures the tendency to have classes that are more populated 
than average, and at the same time classes that are less pop- 
ulated than average, which can be observed in Fig. [2^ (and 
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FIG. 2: Duplication of ARs in the E. coli transcription network, (a) 
ARs are propagated by duplication of the network (See also Table 
la), and need to develop specificity by coevolution. Top: the mech- 
anism for duplication. A is an AR. In an initial stage, the original, 
A, and its copy, A', are identical. This creates a circuit where both A 
and A' are ARs, and there is mutual crosstalk (light blue) links. Sub- 
sequent divergence can erase the links (See Supplementary Note S3). 
Bottom left: population of ARs in the homology classes in the E. coli 
network with original vs randomized domain associations. The x 
axis reports the size of each class of transcription factors, while the y 
axis indicates the fraction of autoregulators in the class. The dashed 
line corresponds to the expected value computed from the total frac- 
tion of ARs. Red dots are randomized instances. Bottom right: His- 
togram of the AR population (number of ARs in the class) of the 
largest homology class (having 15 members) for 10° randomizations 
of the SUPERFAMILY structural domains of the TFs, compared to 
the observed quantity in E. coli (diamond). In most (95%) of the ran- 
domizations the class with 15 members contains less than 11 ARs, 
indicating that duplication is likely, (b) Layers tend to be populated 
by members of the same homology class. Comparison with random- 
izations of the structural domain associations of all the genes. The 
x axis reports the total number of gene pairs of the same homol- 
ogy class belonging to the same layer. The histogram represents the 
randomized case, while the diamond indicates the observed value in 
E. coli. 



Supplementary Fig. S2.5). In spite of this strong evidence 
for the proliferation of ARs through duplication events, we al- 
ready mentioned the absence of any two-node feedback loops 
between homologous (or non-homologous) ARs 30 . This re- 
quires that the initial cross-regulation between duplicated ARs 
(reflecting the fact that binding sites are initially identical) is 
systematically suppressed even if self-regulation is conserved 
for both TF copies (Fig. |2^). We also find that single regu- 
latory links between any kind of TFs in the same homology 
class are very scarce and always involve at least one AR (see 
Fig. S2.7). On average, 91% of the links within a homology 
class of TFs are self-links. 

A simple duplication-divergence model (Fig. [2^ and Sup- 
plementary Note S3) shows that the concomitant conserva- 
tion of self-links and cancellation of cross-talks between du- 
plicated ARs require a selective pressure for evolutionary de- 
coupling. This can be achieved through divergent coevolu- 
tion*^ of duplicate TF/DNA binding interactions. For in- 
stance, a straightforward analysis of the binding sites of CRP 
and FNR, two duplicate ARs regulating many TGs having no 
cross-regulation, shows that their own DNA cis-regulatory re- 
gions have higher specificities than the cis-regulatory regions 
of most of their TGs (See Supplementary Note S4), which 



suggests decoupling of their self-regulatory links. 

Layer Hierarchy and Rewiring. As shown in 8 , a large 
fraction of the non-self-regulatory links of the E. coli tran- 
scription network likely originated from duplication events. 
Indeed, many pairs of TGs from the same homology class are 
regulated by a common TF; likewise, many homologous TFs 
regulate the same TGs, and many pairs of TFs from the same 
homology class regulate homologous pairs of TGs. Clearly, 
the likely duplication events underlying this transcription net- 
work expansion conserve the number of TFs upstream of each 
target, hence leaving the layer hierarchy untouched. The only 
duplication event that can actually add a layer is the duplica- 
tion of an AR, provided that a crosstalk is conserved. A com- 
parison of the homology classes with the populations of the 
network layers (Fig. [2}}, Table|I]3, Supplementary Table S5. la, 
and Supplementary Note S2), shows that globally genes of the 
same homology class tend to populate the same layer. 

In fact, we find only 5 non-self-regulatory links within 
homology classes (see Supplementary Fig. S2.7) and they 
all involve at least one AR, suggesting that they originated 
from duplication events of an AR. For example, the histone- 
like autoregulator H-NS, belonging to layer 2, regulates its 
homolog StpA, which belongs to layer 3 (Supplementary 
Fig. S2.7). Yet, the coincidence between the number of non- 
self -regulatory links within homology classes and the number 
of hierarchical layers in E. coli, does not allow to conclude 
that the layers were generated by AR duplication events. Ev- 
idence for some presumed rewiring of regulatory links also 
exists. For instance, the same AR H-NS (Supplementary 
Fig. S2.7) is also regulated by the cold shock protein CspA, 
which neither regulates any homologs of H-NS, nor has any 
homolog itself in the dataset. It is thus likely that this incom- 
ing regulatory link of H-NS does not come from duplication, 
but rather, from rewiring. Thus rewiring could also be a mech- 
anism for creation of new computational layers. However, we 
find also indications that de novo rewiring of regulatory links 
is limited by the network hierarchy. With respect to random- 
ized instances, there is smaller dispersion of TG homology 
classes over multiple layers of computation than observed for 
TF homology classes. This can be quantified for example by 
the Z-score of the number of gene pairs in the same layer and 
class; the higher this quantity, the more duplication dominates 
on rewiring. We find Z = 1 for layer one (entirely made 
of TFs), while Z = 4.6 for layer two (dominated by TGs). 
This is consistent with an evolutionary scenario leading to an 
early structuration of the transcriptional network into a few 
hierarchical layers of computation (from duplication of ARs 
and limited rewiring as well) followed by a primarily lateral 
expansion of TGs (mostly by duplication). 

Altogether, these observations lead us to conclude that 
maintaining a "shallow" layer structure, where most of the 
computation is performed at the single layer level, seems to 
be important for the E. coli transcription network. A possible 
rationale for this fact is that the time taken by a computational 
cascade involving multiple layers is expected to be roughly 
proportional to the number of layers 26 . Thus, since the net- 
work has to react to a particular stimulus or environment by 
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"switching on" the proper genes without unnecessary delays, 
having many layers might be disadvantageous. For this rea- 
son, it could be interesting to target studies to the sub-systems 
that make use of multi-layer computation (Fig.QJ). 

TABLE I: Evaluation of different evolutionary drives (see also Sup- 
plementary Table S5.1). (a) The table shows that duplicates of ARs 
tend to retain their self-links. This is quantified globally by the ob- 
servables h ar , the average fraction of ARs in classes with two or 
more ARs, and g ar , measuring the spread in the AR population 
among classes that can be observed in Fig. [2^ and Supplementary 
Fig. S2.5.(b) Duplication and divergence preserve the layer structure. 
The first column indicates distance between layers (defined as the 
absolute difference in layer numbers), while the second and the third 
correspond to the population of duplicate genes (genes in the same 
homology class) at that distance, in 10 5 instances with randomized 
domain associations (average values) and the E .coli domain associa- 
tion dataset respectively. For example, the first row (pairs of genes at 
distance zero) concerns the number of duplicate genes which occupy 
the same layer (see Fig.|2p and Supplementary Note S2). The sketch 
in the right panel illustrates the distribution of nodes belonging to the 
same class of TFs (cyan) or TGs (yellow) among the layers, and the 
definition of distance between layers, (c) Fate of gene gains from 
horizontal transfer. TFs are underrepresented both in the class of 
gene gains (columns 2 and 3) and in the class of gene gains that have 
at least a paralog in the homology classes constructed with domain 
associations (columns 5 and 6). 
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Horizontal Gene Transfer. Finally, let us focus on hor- 
izontal gene transfer. We investigated the role of transferred 
genes with respect to their position in the network and in the 
homology classes (TableQ];, Supplementary Table S5.1b). For 
this purpose, we used lists of genes likely to be transferred 
in E. coli from ref^ 4 -. These lists were obtained by a phy- 
logenetic tree reconstruction based on 51 bacterial species. 
With a gain/loss penalty of two, 29% of the genes in the 
network are classified as gene gains. We find that most of 
the gene gains are target genes. Comparison with a sim- 
ple binomial null model shows that most TFs are not likely 
to have been horizontally transferred, while transferred TGs 
are abundant. Hence, one can conclude that in analogy with 
E. coli metabolic network 14 , imported genes are mainly found 
at the "periphery" of the network. Furthermore, transferred 
TGs are not found in large homology classes, defining instead 



mostly single-gene classes, suggesting that gene duplications 
preceded many horizontal gene transfers. Overall, this is con- 
sistent with a view of prokaryote evolution based on ancient 
duplications and conservation of a "stable genetic core" de- 
spite widespread horizontal gene transfer a 13 ' 14 . 

In conclusion, our findings confirm the importance of 
(probably ancient) duplications for the evolution of this net- 
work, and pinpoint to some important trends due to selective 
pressure and evolutionary dynamics, namely, preservation of 
ARs and cancellation of crosstalks, as well as a propensity for 
a feedforward structure with a small number of computational 
layers. The layered hierarchy of E. coli transcription network 
appears to have first emerged and laterally expanded from du- 
plication of a few ARs. Overall, this supports an evolutionary 
scenario based on duplication 8 (with duplicates occupying the 
same layer) and selective deletion of crosstalks between au- 
toregulators (which would otherwise increase the number of 
hierarchical layers). Further duplication-driven lateral expan- 
sions of TG homology classes have then taken place together 
with widespread horizontal gene transfers of new TGs. 

METHODS 

c. Datasets. We used the Shen-Orr and RegulonDB 5.5 
datasets for the transcription network 2,3 . Domain architecture 
data were taken from the SUPERFAMILY database 24 , version 
1.61, as in the datasets of ref. 8 . More recent versions of SU- 
PERFAMILY (we tested 1.69) or the transcription network 3 
do not change the conclusions. The dataset of likely horizon- 
tally transferred genes was generously provided by the authors 
of refpl 4 -. Finally, the binding sites for the clustering analysis 
FNR and CRP (see Supplementary Note S4) were taken from 
the regulonDB 3 dataset. 

d. Network Analysis. We used Fortran 77 implementa- 
tions of different variants (see Supplementary Note SI) of the 
leaf-removal algorithms on the Shen-orr data-set (including 
ARs) and its randomized counterparts, which were obtained 
using a standard Markov Chain Monte Carlo (MCMC) algo- 
rithm that preserves the degree sequence (marginals of the 
adjacency matrix) 27 . This algorithm is best formulated for 
the adjacency matrix of the graph, i.e. the matrix A such as 
(A)ij = 1 if i — > j, and otherwise. We considered un- 
structured counterparts of A. Randomizations with no self- 
links or structurally zero diagonal of A, lead to different re- 
sults. For all the tree-like instances, the number of layers cor- 
respond to the (whole-graph) iterations that are necessary for 
the leaf-removal algorithm to remove the entire graph. In or- 
der to consider a significant sample, the number of MCMC 
iterations was calibrated according to the number of accepted 
MCMC moves 27 . Specifically, we stopped the algorithm after 
T = Kt accepted moves, where r is the number of nonzero 
elements of A, and T = 2000. 

e. Evaluation of Duplications. We constructed classes 
of homologous genes using similarity criteria of the SUPER- 
FAMILY domain architecture. Results given in the body of 
the paper refer to the case where two genes are considered ho- 
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mologs if they share the same domains in the same order, ne- 
glecting domain repeats. A gap is considered a domain. Dif- 
ferent choices lead to very similar results (see Supplementary 
Note S2). For this analysis, proteins coded by the same operon 
were considered as separate entities. Many classes generated 
this way, such as {CRP, FNR}, are supported by evidence 
based on protein sequence comparison. The classes of pro- 
teins obtained this way were compared with TF-TG links in 
the transcription network data-set. Observations related to 
these classes were compared to randomizations that shuffle 
domain associations to gene names, separately for TFs and 
TGs 8 . The data given in the body of the paper correspond to 
10 5 randomizations. 

/ Graph Growth Model. A simple model of 
duplication-divergence was considered, where at each 
time step duplication of the graph is followed by cancellation 
of links with prescribed probabilities (Supplementary Note 
S3). We analyzed the evolution equations for the fraction 
of ARs and of intra-class links, in the different scenarios of 
symmetric and asymmetric divergence, presence or absence 
of selective conservation of ARs, presence or absence of 
constant inflow of ARs. The results were compared with the 
observed trends in the data. 

g. Analysis of Horizontal Gene Transfers. We used lists 
of imported genes obtained by a phylogenetic tree reconstruc- 
tion based on 51 bacterial species 14 . We presented results 
obtained with a gain/loss penalty of two and the hypothesis 
of retarded transfer, or "DELTRANS" assumption. Different 
choices lead to similar results (data not shown). To evaluate 



the partition of transferred genes between TFs and TGs, we 
compared with a simple binomial model where the probabil- 
ity of import is given by the total fraction of imported genes. 
As a null model for the number of imported genes that ap- 
pear in homology classes, we considered classes generated by 
shuffling associations of genes with domain architectures as 
above. 

h. Specificity of TF Binding Sites. Binding sites of two 
duplicate TFs were scored against their logos 28 , obtained with 
the list of all available binding sites from RegulonDB. The 
specificity was defined as the difference between the scores 
of the same binding sites on two different logos. To im- 
prove the sensitivity, logos were computed keeping into ac- 
count reverse-complement sequences and the entropy of mix- 
ing of the sets of binding sites of the two TFs under exam (see 
Supplementary Note S4). 
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(Supplementary Note S2) 

This is not strictly true for the more recent RegulonDB 5.5 dataset, 
where a few of these two-node feedback loops are observable, 
though the signature for negative selection remains (see Supple- 
mentary Note S5) 



